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Abstract 

Nonparametric Bayesian approaches based on Gaussian processes have recently become 
popular in the empirical learning community. They encompass many classical methods 
of statistics, like Radial Basis Functions or various splines, and are technically convenient 
because Gaussian integrals can be calculated analytically. Restricting to Gaussian processes, 
however, forbids for example the implemention of genuine nonconcave priors. Mixtures of 
Gaussian process priors, on the other hand, allow the flexible implementation of complex 
and situation specific, also nonconcave a priori information. This is essential for tasks 
with, compared to their complexity, a small number of available training data. The paper 
concentrates on the formalism for Gaussian regression problems where prior mixture models 
provide a generalisation of classical quadratic, typically smoothness related, regularisation 
approaches being more flexible without having a much larger computational complexity. 
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1 Introduction 

The generalisation behaviour of statistical learning algorithms relies essentially on the correctness 
of the implemented a priori information. While Gaussian processes and the related regularisation 
approaches have, on one hand, the very important advantage of being able to formulate a priori 

*This is an extended version of a contribution to the Ninth International Conference on Artificial Neural 
Networks (ICANN 99), 7-10 September 1999, Edinburgh, UK. 
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information explicitly in terms of the function of interest (mainly in the form of smoothness 
priors which have a long tradition in density estimation and regression problems ||l8 
implement, on the other hand, only simple concave prior densities corresponding to quadratic 
errors. Especially complex tasks would require typically more general prior densities. Choosing 
mixtures of Gaussian process priors combines the advantage of an explicit formulation of priors 
with the possibility of constructing general non-concave prior densities. 

While mixtures of Gaussian processes are technically a relatively straightforward extension 
of Gaussian processes, which turns out to be a computational advantage, practically they are 
much more flexible and are able to produce in principle, i.e., in the limit of infinite number of 
components, any arbitrary prior density. 

As example, consider an image completion task, where an image have to be completed, 
given a subset of pixels ('training data'). Simply requiring smoothness of grey level values 
would obviously not be sufficient if we expect, say, the image of a face. In that case the prior 
density should reflect that a face has specific constituents (e.g., eyes, mouth, nose) and relations 
(e.g., typical distances between eyes) which may appear in various variations (scaled, translated, 
deformed, varying lightening conditions). 

While ways how prior mixtures can be used in such situations have already been outlined in 

H H' ^^^^ paper concentrates on the general formalism and technical aspects of mixture 
models and aims in showing their computational feasibility. Sections ^|-^ provide the necessary 
formulae while Section ^ exemplifies the approach for an image completion task. 

Finally, we remark that mixtures of Gaussian process priors do usually not result in a (finite) 
mixture of Gaussians |^ for the function of interest. Indeed, in density estimation, for example, 
arbitrary densities not restricted to a (finite) mixture of Gaussians can be produced by a mixture 
of Gaussian prior processes. 
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2 The Bayesian model 

Let us consider the following random variables: 

1. X, representing (a vector of) independent, visible variables ('measurement situations'), 

2. y, being (a vector of) dependent, visible variables ('measurement results'), and 

3. h, being the hidden variables ('possible states of Nature'). 
A Bayesian approach is based on two model inputs |ll|, ^, p^ : 

1. A likelihood model p{y\x, h), describing the density of observing y given x and h. Regarded 
as function of h, for fixed y and x, the density p{y\x, h) is also known as the (x-conditional) 
likelihood of h. 

2. A prior model p{h\DQ), specifying the a priori density of h given some a priori information 
denoted by Dq (but before training data Dt have been taken into account). 

Furthermore, to decompose a possibly complicated prior density into simpler components, 
we introduce continuous hyperparameters 9 and discrete hyperparameters j (extending the set of 
hidden variables to h — {h,6,j)), 

pih\Do)^ JdeJ2pih,0,j\Do). (1) 

j 

In the following, the summation over j will be treated exactly, while the ^-integral will be 
approximated. A Bayesian approach aims in calculating the predictive density for outcomes y in 
test situations x 

p{y\x,D)= Jdhp{y\x,h)p{h\D), (2) 

given data D ~ {Dt, Dq} consisting of a priori data Dq and i.i.d. training data Dt = {{xi, yi)|l < 
i < n}. The vector of all Xi (yi) will be denoted xt {yr)- Fig.|l] shows a graphical representation 
of the considered probabilistic model. 
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Figure 1: Graphical representation of the considered probabilistic model, factorising accord- 
ing to p{xT,yT,x,y,h,e,j,{P)\D) = p{xt) p{x) p{yT\xT,h,{(3)) p{y\x,h,{P)) p{h\e, j, Dq, {(3)) 
p{9,j, {/3)\Do). (The variable P is introduced in Section H.) Circles indicate visible variables. 

In saddle point approximation (maximum a posteriori approximation) the /i-integral becomes 

p{y\x,D)^piy\x,h*), (3) 

h* = argmax^g^p(/i|£)), (4) 

assuming p{y\x, h) to be slowly varying at the stationary point. The posterior density is related 
to (xT^conditional) likelihood and prior according to Bayes' theorem 

piyT\xT,h)p{h\Do) 



p{h\D) 



p{yT\xT,Do) 



(5) 



where the /i-independent denominator (evidence) can be skipped when maximising with respect 
to h. Treating the 0-integral within p{h\D) also in saddle point approximation the posterior 
must be maximised with respect to h and simultaneously . 

3 Gaussian regression 

In general density estimation problems p{yi\xi,h) is not restricted to a special form, provided 
it is non-negative and normalised j^, |lOj . In this paper we concentrate on Gaussian regression 
where the single data likelihoods are assumed to be Gaussians 



p{yi\x.,,h) = y — e 



(6) 



In that case the unknown regression function h(x) represents the hidden variables and h— 
integration means functional integration j dh ^ J Yl^ dh{x). 

As simple building blocks for mixture priors we choose Gaussian (process) prior components 



pih\p,e,j,D„) = ( j- 



xe 



(detK,(0))^ 

-i{h~t,{e),j^,{e){h~t,{e))) 



(7) 



the scalar product notation (•, •) standing for x-integration. The mean tj{9){x) will in the 
following also be called an (adaptive) template function. Covariances K~^//3 are real, symmetric, 
positive (semi-) definite (for positive semidefinite covariances the null space has to be projected 
out). The dimension d of the /i-integral becomes infinite for an infinite number of x-values (e.g. 
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continuous x) . The infinite factors appearing thus in numerator and denominator of however 
cancel. Common smoothness priors have tj{9) = and as Kj a differential operator, e.g., the 
negative Laplacian. 

Analogously to simulated annealing it will appear to be very useful to vary the 'inverse 
temperature' /? simultaneously in (^) (for training but not necessarily for test data) and (0). 
Treating (3 not as a fixed variable, but including it explicitly as hidden variable, the formulae of 
Sect. Ijremain valid, provided the replacement h (h, (3) is made, e.g. p{yi\xi^ h) — > p(jji\xi, h, P) 
(see also Fig.|l|). 

Typically, inverse prior covariances can be related to approximate symmetries. For example, 
assume we expect the regression function to be approximately invariant under a permutation of 
its arguments h{x) « h{a{x)) with a denoting a permutation. Defining an operator S acting on 
h according to Sh{x) — h(a{x)), we can define a prior process with inverse covariance 

K = (I-Sf (I-S), (8) 

with identity I and the superscript denoting the transpose of an operator. The corresponding 
prior energy 

Eo = ^{h,Kh)^^(^{h-S)h,{h-S)hy (9) 

is a measure of the deviation of h from an exact symmetry under S. Similarly, we can consider 
a Lie group S = e^** with s being the generator of the infinitesimal symmetry transformation. 
In that case a covariance 

K = 1(1 - Si„f)^(I - Sinf) = s^s, (10) 



02' 

with prior energy 



Eo = ^{sh,sh), (11) 



can be used to implement approximate invariance under the infinitesimal symmetry transfor- 
mation Sinf — 1 + 9s. For appropriate boundary conditions, a negative Laplacian K can thus 
be interpreted as enforcing approximate invariance under infinitesimal translations, i.e., for s = 
d/dx. 

4 Prior mixtures 
4.1 General formalism 

Decomposed into components the posterior density becomes 

/m 
deY,P{VT\xT.h,P) (12) 

3 

xp{h\(3,e,j,Do)p{(3,e,j\Do). 

Writing probabilities in terms of energies, including parameter dependent normalisation factors 
and skipping parameter independent factors yields 

piyT\xT,h,P) oc e-/3ST+tin0 

pih\P,0,j,Do) - e-PEo^^H^-^ (13) 

p{f3,9,j\Do) cx e-^«.''-. 
This defines hyperprior energies Eg^pj, prior energies i?o,j ('quadratic concepts') 



^0. 



\(ii-t,{e),¥.,{e){h^t,{e,j))), (u) 
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(the generalisation to a sum of quadratic terms Eqj — J^k ^o,k,j is straightforward) and training 
or hkehhood energy (training error) 

n 

■i 

= \((h^ tT, Krih - It)) + J2 Mx^)^ ■ 
The second hne is a 'bias- variance' decomposition where 

tH-.) = E^, (16) 



k ^ 



is the mean of the training data available for x^, and 



n. 

k 



is the variance of values at Xi. (Vi vanishes if every Xi appears only once.) The diagonal 
matrix Kt is restricted to the space of x for which training data are available and has matrix 
elements Hx- 



4.2 Maximum a posteriori approximation 

In general density estimation the predictive density can only be calculated approximately, e.g. 
in maximum a posteriori approximation or by Monte Carlo methods. For Gaussian regression, 
however the predictive density of mixture models can be calculated exactly for given 9 (and 
P). This provides us with the opportunity to compare the simultaneous maximum posterior 
approximation with respect to h and 9 with an analytical /i-integration followed by a maximum 
posterior approximation with respect to 9. 

Maximising the posterior (with respect to h, 9, and possibly (3) is equivalent to minimising 
the mixture energy (regularised error functional 1^, |l^ ) 

m 

E=-lnJ2e-^^+'^, (18) 



with component energies 



and 



Ej = f3Ehj + Eg^pj, Eh J — Et + Eqj, (19) 



?,/3) = -In detKj (61) + In /3. (20) 



ilndetK,(0) + -2 

In a direct saddle point approximation with respect to h and 9 stationarity equations are 
obtained by setting the (functional) derivatives with respect to h and 9 to zero. 



Hi 

=Ea,(KT(/i-iT)+K,(/^-i,)), (21) 



] \ / 

where the derivatives with respect to 9 are matrices if is a vector, 

a, = p{j\h,9,Do) (23) 

^-l3Eoj-Eg,pj + ^ln dct Kj 
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and 



dE, dEe,p,, , Jdt, 



de de ' \de 



Eg. (pi|) can be rewritten 



with 



+ '^{{h-t,),^{h-t,)). (24) 
h = K-i |^Kt<t + £ ajK.t,^ , (25) 
K,= |Kt + X]%kJ . (26) 



Due to the presence of ^.-dependent factors Oj, Eg. (psj) is stiU a nonhnear eguation for h(x). 
For the sake of simpUcity we assumed a fixed /?; it is no problem however to solve (|l]) and ( p2[ ) 
simultaneously with an analogous stationarity eguation for /3. 

4.3 Analytical solution 

The optimal regression function under sguared-error loss — for Gaussian regression identical to 
the log-loss of density estimation — is the predictive mean. For mixture model (|2|) one finds, 
say for fixed 

y = j dyyp{y\x,D) = E /'^^ b,{e)t,{e), (27) 

j 

with mixture coefficients 

bj{0) = p{0,j\D) (28) 

_ p{d,j)p{yT\xT,Do,0,j) 



j:jdep{e,j)p{yT\xT,Do,e,j)- 

The component means tj and the likelihood of 9 can be calculated analytically 

= t,+Kj'K,{tT-t,), (29) 

and _ _ 

piyT\xT,Do,e,j) = e-^^o- + ^i"dct(iLK,)^ (30) 

where 



EoAO) = \{tT-tj,K,{tT-tj)), (31) 
K,{e) = (K^i+K7^^)-i, (32) 

and K.J^rp is the projection of the covariance into the n-dimensional space for which 

training data are available, {n < n is the number of data with distinct x-values.) 

The stationarity eguation for a maximum a posteriori approximation with respect to 9 is at 
this stage found from ( p8|j3C| ) 

where Ej = PEqj + Eg^i^j. Notice that Eg.(|3^) differs from Eg. (p^) and reguires only to deal 
with the n x n-matrix K. The coefficient b* = bj{9*) for 9 set to its maximum posterior value 
is of form ( p3| ) with the replacements Kj Kj , Ej Ej . 
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4.4 High and low temperature limits 

Low and high temperature hmits are extremely useful because in both cases the stationarity 
Eq.(^) becomes linear, corresponding thus to classical quadratic regularisation approaches. 
In the high temperature limit (3 —>■ the exponential factors Uj become /i-independent 



«i ' «J - ^mg-i3,,^,, + ilndotK, ' ^'^'^^ 

(for b* — > 6°'* replace Kj by Kj). The solution h = t is a (generalised) 'complete template 
average' 

t = Kj (KTtT + £ a ■ K,t,^ , (35) 

with 

K„o =Kt + X1«jKj- (36) 

j 

This high temperature solution corresponds to the minimum of the quadratic functional Et + 
In the low temperature limit (3 oo only the maximal component contributes, i.e., 

a, '^af^ll ■ '^"^"^'^^l^^' , (37) 

^ J i : J ^ argmmj £;,,j- 

(for b* replace E^^j by Ej) assuming i?/3,e,j = i5/3 + E^^j or i?/3,ej = i?/3 + Ej +(}Eg. Hence, 
low temperature solutions h = tj, are all (generalised) 'component averages' tj provided they 
fulfil the stability condition 

EhAh = t,)<Eh,y{h = tj), VjVj, (38) 

or, after performing a (generalised) 'bias- variance' decomposition, 2Vj < Bji{j,j) + 2Vji , with 
m X m matrices 



g-Ee,i3,j + ^ Indct Kj 



,{k, I) = (t, - t„ {Kd + K,) {ii - t,)) (39) 
and (generalised) 'template variances' 

Vj = l({tT,KTtT) + {t,,K,t,) 

- (f„(KT + K,)t-)^ =^0,,. (40) 

That means single component averages tj (which minimise E^j and thus —f3Ej +Cj) become so- 
lutions at zero temperature in case their (generalised) variance Vj measuring the discrepancy 
between data and prior term is small enough. 



4.5 Equal covariances 

Especially interesting are ^/-independent Kj(0) = Ko(0) with ^-independent determinants so 
detKj or detKj, respectively, do not have to be calculated. 

Notice that this still allows completely arbitrary parameterisations of tj{d). Thus, the tem- 
plate function can for example be a parameterised model, e.g., a neural network or decision tree, 
and maximising the posterior with respect to 9 corresponds to training that model. In such 
cases the prior term forces the maximum posterior solution h to be similar (as defined by Kq) 
to this trained parameterised reference model. 

The condition of invariant detKo(0) does not exclude adaption of covariances. For example, 
transformations for real, symmetric positive definite Ko(0) leaving determinant and eigenval- 
ues (but not eigenvectors) invariant are of the form K(0o) K(6') = O(0)KO^^(0) with real. 
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Figure 2: Left: Example of a solution space for m — 3. Shown are three low temperature 
solutions tj, high temperature solution i, and a possible solution h at finite f3. Right: Exact bi 
vs. (dominant) oi (dashed) for m = 2, 6 = 2, Ei = 0.405, E2 = 0.605. 




Figure 3: Shown are the plots of /i(ai) = ai and /2(ai) = 5 (tanh A + 1) within the inverse 

temperature range < /3 < 4 (for b — 2, E2 — Ei = 0.1(3). Notice the appearance of a second 
stable solution at low temperatures. 

orthogonal = O^. This allows for example to adapt the sensible directions of multidimen- 
sional Gaussians. A second kind of transformations changing eigenvalues but not eigenvectors 
and determinant is of the form K(6'o) = OD(6lo)0^ K(6') 0D(6')0^ if the product of 
eigenvalues of the real, diagonal ^{Oo) and D{9) are equal. 

Eqs.( p9| , ^5| ) show that the high temperature solution becomes a linear combination of the 
(potential) low temperature solutions 
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(41) 



Similarly, Eq.(|2l|) simplifies to 



(42) 



and Eq.(||) to 



e 2' 



bje 2 



raBka—Ek 



(43) 



introducing vector a with components Uj, m x m matrices Bj defined in (p9|). Eq.(p2[) is still a 
nonlinear equation for h, it shows however that the solutions must be convex combinations of the 
/i-independent tj (see Fig. Thus, it is sufficient to solve Eq.(43) for m mixture coefficients 
Qj instead of Eq.(pll) for the function h. 

For two prior components, i.e., m — 2, Eq.(^2|) becomes 



with 



/.= ^ + (tanhA)^, 



E2 - El f3 E2 - El 



(44) 
(45) 
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Figure 4: Top row, from left to right: Data points sampled with Gaussian noise, two template 
functions ti, i2- Bottom row, from left to right: Original, reconstructed solutions (regression 
function h, 180x240 pixels) at low and at high temperature. 

because the matrices Bj are in this case zero except Bi{2, 2) = i?2(l, 1) = b. For Eg p j uniform 
in j we have (fi + 12)/2 = f so that a° = 0.5. The stationarity Eg. (^3|) , being analogous to the 
celebrated mean field equation of a ferromagnet, can be solved graphically (see Fig.^ and Fig.^ 
for a comparison with bj), the solution is given by the point where 



fli = - (tanhA + 1) . 



(46) 



5 A numerical example 



As numerical example we study a two component mixture model for image completion. Assume 
we expect an only partially known image (corresponding to pixel-wise training data drawn 
with Gaussian noise from the original image) to be similar to one of the two template images 
shown in Fig.^. Next, we include hyperparameters parameterising deformations of templates. In 
particular, we have chosen translations {0i, 62) a scaling factor 63, and a rotation angle (around 
template center) ^4. 

Interestingly, it turned out that due to the large number of data {fi « 1000) it was easier to 
solve Eq.(|ll) for the full discretized image than to invert (|3^ ) in the space of training data. A 
prior operator Kq has been implemented as a 3 x 3 negative Laplacian filter. (Notice that using a 
Laplacian kernel, or another smoothness measure, instead of a straight template matching using 
simply the squared error between image and template, leads to a smooth interpolation between 
data and templates.) Completed images h for different [3 have been found by iterating according 
to 

^fe+i = h*' + ■qA-^\^T{tT - h^) + ^o{Y.a''jtj ~ h^y\^, (47) 

j 

performed alternating with ^-minimisation. A Gaussian learning matrix A^^ (implemented by 
a 5 X 5 binomial filter) proved to be successful. Typically, the relaxation factor 77 has been set 
to 0.05. 

Being a mixture model with m = 2 the situation is that of Fig.|[ Typical solutions for large 
and small /3 are shown in Fig.^ 

6 Conclusions 

Prior mixture models are capable to build complex prior densities from simple, e.g., Gaussian 
components. Going beyond classical quadratic regularisation approaches, they still can use 
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the nice analytical features of Gaussians, and allow to control the degree of the resulting non- 
convexity explicitly. Combined with parameterised component mean functions and covariances 
they seem to provide a powerful tool. 
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